home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Ian & Stuart's Australian Mac: Not for Sale
/
Another.not.for.sale (Australia).iso
/
fade into you
/
being there
/
How To & FAQ's
/
using archie
< prev
next >
Wrap
Text File
|
1994-11-09
|
22KB
|
522 lines
A SURAnet guide the Archie service (v1.0.1)
by Eric Anderson
This document may be converted to another format for use on different
machines, and aggregated with other files for distribution. But,
please do not modify the content; sections of this document should
not be removed or modified. Please address questions on this policy,
requests for exceptions, comments, or suggestions to archie-admin@sura.net
This document is available from ftp.sura.net as
pub/archie/docs/SURAnet-archie-guide.txt
Table of Contents
----- -- --------
1.0 Introduction
2.0 Data Stored In Archie System
3.0 Searching Through the Database
3.1 Exact Searches
3.2 Case Sensitive Searches
3.2.1 Examples of Case Sensitive Searches
3.3 Case Insensitive Searches
3.3.1 Examples of Case Insensitive Searches
3.4 Regular Expression Searches
3.4.1 Examples of Regular Expression Searches
3.4.2 Converting from File Matching Expressions to Archie Regexps
3.4.3 Examples of Converted File Matching Expressions
4.0 Different Access Methods
4.1 Remote Client Access
4.2 Interactive Interfaces
4.3 Mail Interface
Appendix A: Regular Expressions
A.1 Overview
A.2 Using Regular Expressions
A.3 Matching an arbitrary character
A.4 Matching repeated characters
A.5 Matching strings at the front or back
A.6 Matching a character out of a set of choices
A.7 Forcing special characters to be normal
Appendix B: Glossary of Terms
1.0 Introduction
--- ------------
The Archie service provides a database of files available for anonymous
ftp, a way of retrieving files that are publicly available on the Internet.
Users can locate files by searching the Archie database, which is indexed by
the names of the files. Therefore, users need to know a part of the file's
name to locate a file using Archie.
This document describes:
* The data in the archie database, so that people can understand what
queries will produce useful results;
* The different search methods, with examples;
* The different programs that can search the Archie database;
* A description of Archie regular expressions.
2.0 Data Stored In The Archie System
--- ---- ------ -- --- ------ ------
The Archie system stores the names of files available for anonymous ftp.
Therefore searches of the database need to use the name of the file,
not words associated with the file. For example, a search for the
dos version of the gnu C compiler requires knowing the name of that
package is djgpp. The Archie systems acquires the names of files stored at
anonymous ftp sites by ftp-ing to the site and getting a listing of all the
files stored there, i.e. files indexed in the database use the naming
conventions of each individual site instead of a common naming convention.
To reduce the load on anonymous ftp sites, the Archie system retrieves
the listings of files infrequently. Different Archie sites have different
policies on how many sites they retrieve and how often they retrieve site
listings, but usually about 50 listings are retrieved a night. Since there
are over 1000 sites, each site is updated about once every 20 days.
Furthermore, the different Archie sites are on different rotations for
retrieving listings. Therefore, different sites can return different
results, and newly added files are unlikely to be indexed in Archie until
a few days after they are made available.
3.0 Searching Through the Database
--- --------- ------- --- --------
There are four main ways of searching through the database:
* Exact Searching (exact)
-- Very fast, returns only exact filename matches
* Case Sensitive Substring Searches (subcase)
-- Medium fast, returns filenames with the substring in the filename
* Case Insensitive Substring Searches (substring)
-- Medium slow, returns filenames with the substring in any case in
the filename
* Regular Expressions (regexp)
-- Slow, returns filenames which match the regular expression anywhere
in the string
3.1 Exact Searching
--- ----- ---------
Exact searching is useful if the name of the desired program is known, and
the user is looking for the closest copy.
For example, a search for version 2.09 of xarchie using the string
xarchie-2.0.9.tar.gz would find the file on 15 different hosts spread across
the world. A search for the same file using the string
xarchie-2.0.9.tar.Z would find the file on 25 different hosts.
3.2 Case sensitive substring searching (subcase)
--- ---- --------- --------- --------- ---------
Subcase searching is useful if part of the filename is known, but (for
example) not the most current version. Subcase searching is more precise
than substring searching if the case of the part of the filename is known;
subcase searching will reduce the number of irrelevant matches. A search
for the most recent version of the TeX package could use the search string
TeX, which would find files which have the string TeX in them, but not those
with only the string tex in them.
3.2.1 Examples of Case Sensitive Searches
----- -------- -- ---- --------- --------
Search String Returned Names
------ ------ -------- -----
TeX- TeX-3.14.tar.gz, SeeTex-2.18.5.tar.Z, TeX-index,
GETTING-TeX-FILES, ...
gcc-2.5 gcc-2.5.0-2.5.0a.diff.gz, gcc-2.5.0.tar.gz, gcc-2.5.2.tar.gz,
gcc-2.5.0-cpp.ps.gz, gcc-2.5.3.tar.gz, not.gcc-2.5.0.tar.gz,
gcc-2.5.0-2.5.1.diff, ...
3.3 Substring Searching
--- --------- ---------
Substring searching is useful if a portion of the filename is known, but not
the most current version or the case of the filename. In the previous example
of searching for TeX, a substring search would find filenames with either TeX
or tex in them.
3.3.1 Examples of Case Insensitive Searches
----- -------- -- ---- ----------- --------
Search String Returned Names
------ ------ -------- -----
tex macros.tex, bib.tex, README.TeX, syd.tex, LNM.tex, ...
TeX macros.tex, bib.tex, README.TeX, syd.tex, LNM.tex, ...
TeX- revtex-30.hqx, videotex-terminal-tool.hqx, latex-style,
jtex-1.43-1.44.tar.gz, jtex-util.tar.Z, psfig-tex-1.4.tar.Z,
bibtex-style, latex-style-misc.ann, ...
xf-2 Announcing-XF-2-2.z, xf-2.2.tar.gz, announcing-xf-2-2, ...
3.4 Regular Expression Searches
--- ------- ---------- --------
Regular expressions allow specifying that the filename begin with a specific
string, or specifying two sections of the filename separated by some unknown
characters. Regular expressions are different from file matching expressions
used by the shell at the command line; a conversion process is provided in
section 3.4.2. A complete description of archie regular expressions is in
Appendix A.
3.4.1 Examples of Regular Expressions Searches
----- -------- -- ------- ----------- --------
Searching for complete distributions of emacs 19:
^emacs.*19[^-]*\.tar
Which will find filenames starting with emacs, having
19 somewhere after that, and then any character but a -
until a .tar , i.e.: emacs-19.15.tar.gz,
emacs-19.16.tar.gz, emacs-19.17.tar.gz, ...
Searching for diffs between versions of emacs 19:
^emacs.*19.*-.*\.tar
Which will find filenames starting with emacs, having
19 somewhere after that, then a dash somewhere after that
and finally a .tar , i.e: emacs-19.16-A.bin.tar.gz,
emacs-19.16-alpha.src.tar.Z, emacs-19.15-A.el.2of2.tar.gz,
emacs-19.15-A.bin.tar.gz, ...
Since the above expression didn't do what I expected:
^emacs.*19\.[0-9].*-.*[0-9][0-9]
Which find filenames starting with emacs, having
19.<digit> somewhere after that, then separated by some
number of characters, a dash, then some more characters,
then two digits, which matches: emacs-19.12-19.13.diff.gz,
emacs-19.11-19.12.diff.gz, emacs-19.10-19.11.diff.gz, ...
Searching for a tar version of a package:
^xce.*tar Which will find filenames starting with xce and having
tar somewhere after that, for example xce-1.00.tar.Z,
xcell.tar.z and xce.tar.gz
Searching for current sun bug fixes:
[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]\.tar
Which will find filenames which consist of six digits,
then a dash, two digits, a period and then the string
tar, for example 100075-09.tar.Z and 100149-03.tar.Z
Searching for the third major revision of xv:
^xv.*3.*tar Which will find filenames starting with xv, some
number of characters, the number three, some number
of characters and then the string tar, for example,
xv-3.00.tar.gz, xv-2.21.386bsd.bin.tar.Z, and
xview-3.part3.tar.gz
A better version of the above:
^xv-3\..*tar Which forces the first five characters to be xv-3., and
then the string tar at some point later. Which finds:
xv-3.00a.tar.Z and xv-3.00.tar.Z
3.4.2 Converting from File Matching Expressions to Archie Regexps
----- ---------- ---- ---- -------- ----------- -- ------ -------
File matching expressions, or file globs, are typed in to the shell or
command line to match filenames. Common examples are *.txt, *.exe, x*,
p*.zip, etc. The following rules should convert file globs to Archie
regexps. An explanation of what each symbol does can be found in Appendix A.
1. Prepend ^ to the file glob, and $ to the end
2. Replace all occurrences of . with \.
3. Replace all single character matches, such as ? with .
4. Replace all multiple character matches, e.g. * with .*
Note that regular expression matches are case sensitive. If arbitrary
cases for each letter should match, then each letter needs to be replaced
with the upper and lower case versions of that letter in brackets. E.g.
replace b with [bB]
If a command line interface is used for searching, then single quotes (')
probably need to surround the regular expression argument.
If neither the single nor multiple character matches have been used, then
the search can probably be performed using a substring or subcase search,
which should be faster than the regular expression search.
3.4.3 Examples of Converted File Matching Expressions
----- -------- -- --------- ---- -------- -----------
File Expression Archie Regexp
---- ---------- ------ ------
*.txt ^.*\.txt$
*.exe ^.*\.exe$
x* ^x.*$
p*.zip ^p.*\.zip$
4.0 Different access methods
--- --------- ------ -------
The Archie database resides on a number of different machines on the Internet.
A server on each machine allows users to query the database from their
machines. Clients programs are programs which contact the server to
search the database. By separating the client from the server, the load
on the server is reduced, allowing faster processing of the searches.
4.1 Remote Client Access
--- ------ ------ ------
Remote client access is the preferred method to access the database. Since
the Archie machine must only process the query, it has to do much less work
than if the telnet or mail interfaces are used. This is especially true for
version 2.X of the Archie system which handles interactive and mail queries
very inefficiently.
All version numbers and locations are the best sites I know as of
Nov 14, 1993. Archie can be used to locate newer versions, or find copies
which are stored closer to your site. As the clients are updated,
the version numbers will change. Therefore, after finding where a copy
of the programs are stored, check to make sure that the most recent version
is retrieved.
There are two main programs which provide client access, the c-archie client
and the xarchie program. There is a version of the c-archie client which has
been compiled for VMS machines.
The c-archie client can be retrieved from ftp.sura.net as
pub/archie/clients/c-archie-1.4.1-FIXED.tar.Z
The VMS version of the c-archie client can be retrieved from ftp.sura.net as
/pub/archie/clients/c-archie-1.3.2-vms.com
The x-archie client can be retrieved from ftp.x.org
as contrib/xarchie-2.0.9.tar.Z
There is also a perl archie client, available from ftp.sura.net
as pub/archie/clients/perl-archie-3.8.tar.Z
A NeXT archie client can be retrieved from ashley.cs.widener.edu as
pub/archie/archie-NeXT.tar.Z
A mac archie client is available from
mac.archive.umich.edu as /mac/util/comm/anarchie1.00.sit.hqx
There appears to be a mac client available from
pprg.eece.unm.edu as /pub/Mac/sumex/comm
4.2 Interactive Interfaces
--- ----------- ----------
The interactive interface should be used only if the remote clients
are unavailable. Users can telnet to any archie site and login as archie.
On the host archie.sura.net, users can log in as qarchie, which provides
a faster interactive interface.
4.3 Mail Interface
--- ---- ---------
Mail clients should be used as a last resort. By sending e-mail to
archie@<archie-site> with the word help in the body or subject, users can
receive a file which explains how to use the mail interface to archie.
Appendix A: Regular Expressions
-------- -- ------- -----------
A.1 Overview
--- --------
Regular expressions are very powerful ways of describing filenames.
They allow the following features:
* Forcing certain characters to be repeated exactly n times, or at least
n times, where n >= 0
* Specifying that a string should occur at the front or back or a
matched filename
* Specifying sets of characters which can match
The following characters are special in Archie regular expressions:
. ^ * $ \ [ ]
The regular expressions used in Archie are known as ed regular expressions
because they were derived from the ed editor which is found on UNIX
workstations. ed regular expressions are different from the file matching
expressions that are used at the command line. Section 3.4.2 describes
how to convert file matching expressions to archie regexps.
A.2 Using Regular Expressions
--- ----- ------- -----------
Individually, many of these capabilities are not useful or can be better
handled as another type of search. However, in combination they can
accurately specify the set of names desired, lowering the number of
useless matches. I usually think about regular expressions as a series
of pieces of a filename stuck together, allowing me to understand what
filenames will be matched.
A.3 Matching an arbitrary character
--- -------- -- --------- ---------
To match an arbitrary character, the . symbol is used. For example the
regular expression ........ will match all filenames which have at least
eight characters in them, because Archie regular expressions match
any part of a filename.
A.4 Matching something repeated multiple times
--- -------- --------- -------- -------- -----
A large amount of power in regular expressions comes from the ability to
match a specified set of characters repeated an arbitrary number of times.
This is more powerful than file globs which cannot specify a set. The *
character means that the previous element in a regular expression should be
matched an arbitrary number of times. For example to match something which
has ctwm and tar in the filename, separated by some set of arbitrary
characters, the regular expression ctwm.*tar would be used. To find the
versions of c2man which were archived from one of the comp.sources groups,
the regular expression c2man.*[0-9][0-9]* would be used. It would find all
filenames with c2man in the string, followed by some or no characters,
followed by at least one digit. This search would match filenames like
c2man-2.0.13.tar.gz, c2man-2.03, c2man-1.10.tar.Z
A.5 Anchoring strings at the front or the back
--- --------- ------- -- --- ----- -- --- ----
To anchor a particular string to the front or the back of a matching
filename, a ^ is put at the front or a $ at the back. For example, to find
zip file, the search string zip$ would be used. To find filenames starting
with gdb, the string ^gdb would be used. These options can be combined, the
search string ^foo$ would match a file which has precisely the name foo. This
particular search would be better as an exact search because the results would
be returned faster, however a search for compressed versions of gcc, using the
regexp ^gcc.*tar\.Z$ would be good uses of the power of regular expressions.
In the earlier example of ........ , filenames of exactly eight characters
could be matched using ^........$
A.6 Matching a character out of a set of possibilities
--- -------- - --------- --- -- - --- -- ------------
There are two ways to specify characters to be matched. First as a list
of possibilities, and second as a list of unacceptable possibilities.
To match any digit, the regular expression [0-9], which says match any
character from 0 to 9. The expression [a-zA-Z] would match any alphabetic
character, and the expression [aeiou] would match any vowel.
To match characters not in a set, the ^ character is placed after the left
bracket. To find a file which starts with tar, has some characters,
then a non-alphabetic character, some characters and then tar,
the regular expression tar.*[^a-zA-Z].*tar could be used.
A.7 Forcing special characters to be normal
--- ------- ------- ---------- -- -- ------
. ^ * $ \ [ ] are characters with special meanings in Archie
regular expressions. To specify that these characters are exactly matched,
the special character needs to be escaped, by putting the \ character in
front of it. For example, the expression \.tar$ would find filenames ending
in .tar, but the expression .tar$ could find filenames ending in ftar.
Appendix B: Glossary
-------- -- --------
anonymous ftp
Anonymous ftp is a way of accessing publicly available files.
Normally you would use the ftp command with the user name anonymous.
It is customary to give your e-mail address as the password so that people
will know who is retrieving files; indeed, some sites require a valid
e-mail address before allowing you to retrieve files.
anchoring
Anchoring means forcing a section of a filename to match to either the
front or the back of the filename, i.e. the string is anchored to the
front[back] of the filename.
case
Upper(A-Z) or lower(a-z) case. Case insensitive searches consider A to
be equivalent to a, B to b, etc.
client
A program which runs on one machine and accesses a server to gather some
form of information
command line
Where commands are typed in to the prompt. Usually the program that
receives the commands is called the shell.
compress
A program which removed redundancy producing a shorter output file.
diffs
Differences. When a package is upgraded to a new version, authors usually
provide the differences between the old and new version of the package
because the differences are smaller, and because by applying the differences
to their own copy of the sources, users are not forced to re-make any local
changes.
ed
One of the first editors made available on unix workstations. The
regular expressions used in ed are very similar to the ones used in archie.
e-mail
Electronic Mail. A common way of communicating across the Internet.
file globs
See file matching expressions.
file matching expressions (file globs)
Expressions that are typed into the shell at the command line to specify
files. File globs are useful for specifying multiple files.
host
see site
Internet
The Internet is the collection of hosts connected through the NSFnet
backbone, which started as a DARPA project. The Internet now reaches
sites across the world.
package
A collection of files that make up a program. Usually the sources to a
particular program, but the parts of a package can include data files,
binaries, etc.
regular expressions
A name for expressions which can be matched by finite automata, a "machine"
with a finite number of states which can change from one state to another
given a single character of input.
regexps
see regular expressions
shell
A program which parses file globs and executes programs. Usually a shell
has other features such as input/output redirection, repeating, job
control, etc. See also command line.
server
The part of a server-client system which receives requests from the client,
processes those requests and returns the results to the client. There is
an archie server running on the machines which provide archie service.
subcase
Case sensitive substring matches, which means that the filename matched must
have the same case as the search string. See also substring and case.
substring
Case insensitive substring matches, which means that the filename matched
can have any case relative to the search string provided that the letters
are the same. See also subcase and case.
site
Usually a machine on the Internet, for example the anonymous ftp site
ftp.sura.net. Sometimes generalized to mean a group of machines,
for example the Carnegie Mellon site.
tar
A program which gathers a collection of files together into one file
for transmission or storage. tar preserves the names and subdirectories
of the gathered files.